Quantitative exercise

By Group 1: Ya Ting Hu & Zhen Tian

Dataset 2 - Haberman Survival

  1. Title: Haberman's Survival Data

  2. Sources: (a) Donor: Tjen-Sien Lim (limt@stat.wisc.edu) (b) Date: March 4, 1999

  3. Past Usage:

    1. Haberman, S. J. (1976). Generalized Residuals for Log-Linear Models, Proceedings of the 9th International Biometrics Conference, Boston, pp. 104-122.
    2. Landwehr, J. M., Pregibon, D., and Shoemaker, A. C. (1984), Graphical Models for Assessing Logistic Regression Models (with discussion), Journal of the American Statistical Association 79: 61-83.
    3. Lo, W.-D. (1993). Logistic Regression Trees, PhD thesis, Department of Statistics, University of Wisconsin, Madison, WI.
  4. Relevant Information: The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

  5. Number of Instances: 306

  6. Number of Attributes: 4 (including the class attribute)

  7. Attribute Information:

    1. Age of patient at time of operation (numerical)
    2. Patient's year of operation (year - 1900, numerical)
    3. Number of positive axillary nodes detected (numerical)
    4. Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year
  8. Missing Attribute Values: None

1. Do the pre-processing necessary to load the data into the analysis tool that is to be used for your project. While using R is recommended, the actual choice of tool is up to each of the groups.

We have decided to use Python together with Jupyter notebook for our project.

2. Load your data into the analysis tool.

3. Identify which are your independent variables and which are your dependent variables (in the data which you have collected or generated). Write a description of the expected properties of each of these variables.

4. Perform exploratory data analysis. Does your data have the expected properties? If not, can you identify why it does not?

5. Identify what statistical tools you will apply to analyze your data. If you have a model for the relationship between the independent and the dependent variables fit your data to this model. If you do not have a model, what could you do to identify a model, for example using principal component analysis?

OLS
Logit
Logistic

6. Generate some visual aids (such as tables or graphs) to present your data to others.

See above figures.